CUDA 程式設計指南：超越串流：現代 CUDA 優化環境

現代 CUDA 優化環境代表了一種 范式轉變 從傳統的、受制於 CPU 的串流執行，轉向一種自主且硬體加速的生態系統。此轉變透過將記憶體配置、同步與核函數調度直接交由 GPU 硬體處理，大幅降低主機端的負載。

優化從驅動程式開始。現代應用程式使用 cuInit 和 cuModuleLoad 來管理模組。一個關鍵特性是 延遲載入 （CUDA_MODULE_LOADING=LAZY），即函數僅在首次被呼叫時才載入至 GPU 環境，大幅減少記憶體佔用與啟動延遲。

透過使用 PTX （平行線程執行）與 cubin，即時編譯器確保高階 PTX 能針對目標 GPU 的 架構特定功能集 在執行時期進行最佳化。例如，以 CUDA 11.3 編譯，即可在 11.4 版驅動程式上執行而無需重新編譯，這是因為具有向前相容的 ABI。

現代執行受到嚴格的資源對應關係所規範，介於 參數緩衝區（PB） 和 線程塊（TB）之間。這可數學表示為：

$$PB = \{BP_0, BP_1, \dots, BP_L\}, \quad TB = \{BT_0, BT_1, \dots, BT_L\}$$

其中硬體約束驗證確保 $$BT_n \le BP_m$$ 對於 $$n \le m$$ 成立。此架構允許透過 cudaLaunchDevice 實現自主啟動，同時不超出硬體限制。

優化現在需要對已管理資料有全域可見性。類似 cudaMemPrefetchAsync 與 系統分配器 等原語，使 GPU 可在核函數執行前預先準備資料，消除異質平台上的同步瓶頸，這些平台包含 Arm CPU 和 NVIDIA GPU。

TERMINALbash — 80x24

> Ready. Click "Run" to execute.

QUESTION 1

What is the primary benefit of setting CUDA_MODULE_LOADING=LAZY?

It increases the clock speed of the GPU cores.

It loads functions into the GPU context only when they are first invoked.

It disables all error checking for faster execution.

It forces the CPU to handle all memory allocations.

QUESTION 2

Which mathematical condition ensures that autonomous launches stay within hardware limits?

$$BT_n > BP_m$$

$$BT_n \le BP_m$$ for $$n \le m$$

$$PB + TB = 0$$

$$L = 0$$

QUESTION 3

What does cudaMemPrefetchAsync do in the modern optimization landscape?

It deletes unused memory on the host.

It proactively moves data to the GPU before a kernel uses it.

It compiles PTX code into cubin.

It synchronizes all CPU threads.

QUESTION 4

What is the role of PTX (Parallel Thread Execution) in CUDA?

It is the physical hardware architecture.

It is a low-level virtual machine and instruction set for JIT compilation.

It is a tool for debugging memory leaks.

It is a host-side library for file I/O.

QUESTION 5

How do CUDA Graphs improve performance over traditional stream-based execution?

By increasing the number of available CUDA cores.

By reducing CPU-to-GPU launch overhead through 'baked' execution sequences.

By automatically converting C++ code to Python.

By disabling the need for GPU memory.